[RFC] diff: add diff.<driver>.process for external hunk providers#2120
[RFC] diff: add diff.<driver>.process for external hunk providers#2120mmontalbo wants to merge 6 commits into
diff.<driver>.process for external hunk providers#2120Conversation
|
/preview |
|
Preview email sent as pull.2120.git.1779415095.gitgitgadget@gmail.com |
diff.<driver>.process for external hunk providers
|
/preview |
|
Preview email sent as pull.2120.git.1779415483.gitgitgadget@gmail.com |
dfcb1c3 to
8c7359b
Compare
|
/submit |
|
Submitted as pull.2120.git.1779415884.gitgitgadget@gmail.com To fetch this version into To fetch this version to local tag |
|
Junio C Hamano wrote on the Git mailing list (how to reply to this email): "Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes:
> This series adds diff.<driver>.process, a long-running subprocess protocol
> that lets external tools provide hunks to git's diff and blame pipelines.
>
> Over the past 18 years, git's diff pipeline accumulated many features that
> operate on hunks: word diff, function context, color-moved, indent
> heuristic, blame. External tools can replace the pipeline entirely
> (diff.<driver>.command) or select among builtin algorithms
> (diff.<driver>.algorithm), but there is no way for a tool to provide
> line-change information into the pipeline. Tools that understand code
> structure (tree-sitter parsers, format-aware analyzers, tools like
> Difftastic and Mergiraf) must bypass git's pipeline and lose access to
> everything downstream.
>
> The protocol follows filter.<driver>.process: pkt-line over stdin/stdout,
> capability negotiation, one tool invocation per git command. The tool
> receives file pairs and returns hunk descriptors that git feeds into the
> standard xdiff pipeline. All output features work normally.
>
> Zero hunks with status=success means the tool considers the files
> equivalent. git diff shows no output for the file, and git blame skips the
> commit, attributing lines to earlier commits.
>
> On error or tool crash, git falls back silently to the builtin diff
> algorithm. The feature is opt-in via diff.<driver>.process and
> .gitattributes; unconfigured files are unaffected.
>
> The series includes git diff-process-normalize, a built-in tool that
> compares files line by line ignoring whitespace (same logic as "git diff -w"
> via xdiff_compare_lines):
Interesting.
If the goal is purely to normalize content before comparison
(e.g. stripping comments or canonicalizing formatting), we already
have the `textconv` mechanism. While `textconv` is a "one-shot"
per-file process, it is significantly simpler.
I suspect, however, that the primary focus here is to allow external
tools to provide structural alignment (e.g. for AST- aware diffs
like Difftastic or Mergiraf) without losing the original content in
the display. Unlike `textconv`, which transforms the text the user
sees, this protocol lets the display remain identical to the source
while using a custom engine for the line-matching logic.
If that is the intent, it should be stated more explicitly in the
documentation and commit messages. The "whitespace-normalize"
demonstration in [PATCH 5/5] is misleading because it's exactly the
case where `textconv` would be sufficient.
I am afraid that the use of a long-running subprocess for every
diff/blame invocation adds significant complexity and overhead. In
particular, wouldn't the `blame` implementation performs a
round-trip to the subprocess for every commit in the history? Even
with a persistent process, the overhead of serializing and
deserializing the entire file content twice (old and new) for every
commit could be prohibitive for large files or deep histories.
So, I dunno. |
| @@ -124,7 +124,12 @@ int xdi_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdemitconf_t co | |||
| if (mf1->size > MAX_XDIFF_SIZE || mf2->size > MAX_XDIFF_SIZE) | |||
There was a problem hiding this comment.
Junio C Hamano wrote on the Git mailing list (how to reply to this email):
"Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes:
> +/*
> + * Populate the changed[] arrays from externally supplied hunks,
> + * bypassing the diff algorithm. Validates that hunks are in order,
> + * non-overlapping, and within bounds.
> + *
> + * Returns 0 on success, -1 on validation failure.
> + */
> +static int xdl_populate_hunks_from_external(xdfenv_t *xe,
> + const struct xdl_hunk *hunks,
> + size_t nr_hunks)
> +{
> + size_t i;
> + long j, prev_old_end = 0, prev_new_end = 0;
> + long total_old = 0, total_new = 0;
> +
> + /*
> + * Clear changed[] arrays. xdl_prepare_env() may have dirtied
> + * them via xdl_cleanup_records(). The allocation is nrec + 2
> + * elements; changed points one past the start (see xprepare.c).
> + */
> + memset(xe->xdf1.changed - 1, 0,
> + (xe->xdf1.nrec + 2) * sizeof(bool));
> + memset(xe->xdf2.changed - 1, 0,
> + (xe->xdf2.nrec + 2) * sizeof(bool));
This, especially the starting offset of -1, looks horrible. The
internal layout of xdfenv_t might happen to match the way the above
code expects, which is how xdl_prepare_ctx() may have give you, but
it somehow feels brittle. I guess the assumption that changed[]
does not point at the beginning of the allocated area (e.g., it is a
no-no to free(xe->xdf1.changed) or realloc() it) is so pervasive that
it cannot be helped. Sigh.
> int xdl_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
> xdemitconf_t const *xecfg, xdemitcb_t *ecb) {
> xdchange_t *xscr;
> xdfenv_t xe;
> emit_func_t ef = xecfg->hunk_func ? xdl_call_hunk_func : xdl_emit_diff;
>
> - if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0) {
> -
> - return -1;
> + if (xpp->external_hunks) {
> + if (xdl_prepare_env(mf1, mf2, xpp, &xe) < 0)
> + return -1;
> + if (xdl_populate_hunks_from_external(&xe,
> + xpp->external_hunks,
> + xpp->external_hunks_nr) < 0) {
> + /*
> + * Invalid external hunks; fall back to the
> + * builtin diff algorithm. Re-runs
> + * xdl_prepare_env() via xdl_do_diff().
> + */
> + xdl_free_env(&xe);
> + if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0)
> + return -1;
If the external tool keeps sending bogus hunks, silently falling
back to what we would have done if there weren't any external stuff
may be necessary to pleasantly keep using Git, but two and a half
short comments here.
(1) "What we would have done" is exactly the same as what appears
in the corresponding "else" block. Can we make sure that we do
not have to keep updating both copies in the future with some
code rearrangement?
(2) The writer of the external tool may want to see some trace of
warning under certain flags when a failure of the tool forces
the receiving end to fallback.
(3) If the tool throws too many broken replies, perhaps we want to
disable it automatically?
> + }
> + } else {
> + if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0)
> + return -1;
> }
> +
> if (xdl_change_compact(&xe.xdf1, &xe.xdf2, xpp->flags) < 0 ||
> xdl_change_compact(&xe.xdf2, &xe.xdf1, xpp->flags) < 0 ||
> xdl_build_script(&xe, &xscr) < 0) {There was a problem hiding this comment.
Michael Montalbo wrote on the Git mailing list (how to reply to this email):
On Thu, May 21, 2026 at 10:29 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +/*
> > + * Populate the changed[] arrays from externally supplied hunks,
> > + * bypassing the diff algorithm. Validates that hunks are in order,
> > + * non-overlapping, and within bounds.
> > + *
> > + * Returns 0 on success, -1 on validation failure.
> > + */
> > +static int xdl_populate_hunks_from_external(xdfenv_t *xe,
> > + const struct xdl_hunk *hunks,
> > + size_t nr_hunks)
> > +{
> > + size_t i;
> > + long j, prev_old_end = 0, prev_new_end = 0;
> > + long total_old = 0, total_new = 0;
> > +
> > + /*
> > + * Clear changed[] arrays. xdl_prepare_env() may have dirtied
> > + * them via xdl_cleanup_records(). The allocation is nrec + 2
> > + * elements; changed points one past the start (see xprepare.c).
> > + */
> > + memset(xe->xdf1.changed - 1, 0,
> > + (xe->xdf1.nrec + 2) * sizeof(bool));
> > + memset(xe->xdf2.changed - 1, 0,
> > + (xe->xdf2.nrec + 2) * sizeof(bool));
>
> This, especially the starting offset of -1, looks horrible. The
> internal layout of xdfenv_t might happen to match the way the above
> code expects, which is how xdl_prepare_ctx() may have give you, but
> it somehow feels brittle. I guess the assumption that changed[]
> does not point at the beginning of the allocated area (e.g., it is a
> no-no to free(xe->xdf1.changed) or realloc() it) is so pervasive that
> it cannot be helped. Sigh.
>
Agreed it is ugly. I wanted to make sure the entire changed[] including
sentinels were clear as a defensive measure for downstream callers
(xdl_change_compact). I agree this results in something that is ugly
and brittle, but in the end I thought it was superior to relying on the
fact that upstream zeroes the entire changed[] array. Maybe if the
comment was more explicit about why this is happening it would be
helpful?
/*
* Clear changed[] arrays including sentinels.
* xdl_prepare_env() may have dirtied them via
* xdl_cleanup_records(), and xdl_change_compact() reads
* the sentinel at changed[-1] during backward scans.
*/
> > int xdl_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
> > xdemitconf_t const *xecfg, xdemitcb_t *ecb) {
> > xdchange_t *xscr;
> > xdfenv_t xe;
> > emit_func_t ef = xecfg->hunk_func ? xdl_call_hunk_func : xdl_emit_diff;
> >
> > - if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0) {
> > -
> > - return -1;
> > + if (xpp->external_hunks) {
> > + if (xdl_prepare_env(mf1, mf2, xpp, &xe) < 0)
> > + return -1;
> > + if (xdl_populate_hunks_from_external(&xe,
> > + xpp->external_hunks,
> > + xpp->external_hunks_nr) < 0) {
> > + /*
> > + * Invalid external hunks; fall back to the
> > + * builtin diff algorithm. Re-runs
> > + * xdl_prepare_env() via xdl_do_diff().
> > + */
> > + xdl_free_env(&xe);
> > + if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0)
> > + return -1;
>
> If the external tool keeps sending bogus hunks, silently falling
> back to what we would have done if there weren't any external stuff
> may be necessary to pleasantly keep using Git, but two and a half
> short comments here.
>
> (1) "What we would have done" is exactly the same as what appears
> in the corresponding "else" block. Can we make sure that we do
> not have to keep updating both copies in the future with some
> code rearrangement?
>
How about something like this:
if (xpp->external_hunks) {
if (xdl_prepare_env(mf1, mf2, xpp, &xe) < 0)
return -1;
if (xdl_populate_hunks_from_external(&xe,
xpp->external_hunks,
xpp->external_hunks_nr) == 0)
goto diff_done;
xdl_free_env(&xe);
}
if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0)
return -1;
diff_done:
> (2) The writer of the external tool may want to see some trace of
> warning under certain flags when a failure of the tool forces
> the receiving end to fallback.
>
In diff.c how about we emit a warning rather than a trace on
fallback:
warning(_("diff process failed for '%s',"
" falling back to builtin diff"),
name_a);
> (3) If the tool throws too many broken replies, perhaps we want to
> disable it automatically?
>
For the RFC I wanted to keep it simple, but I definitely agree. A configurable
failure policy makes a lot of sense to me (e.g., disable after N failures).
> > + }
> > + } else {
> > + if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0)
> > + return -1;
> > }
> > +
> > if (xdl_change_compact(&xe.xdf1, &xe.xdf2, xpp->flags) < 0 ||
> > xdl_change_compact(&xe.xdf2, &xe.xdf1, xpp->flags) < 0 ||
> > xdl_build_script(&xe, &xscr) < 0) {There was a problem hiding this comment.
Junio C Hamano wrote on the Git mailing list (how to reply to this email):
Michael Montalbo <mmontalbo@gmail.com> writes:
>> > + * Clear changed[] arrays. xdl_prepare_env() may have dirtied
>> > + * them via xdl_cleanup_records(). The allocation is nrec + 2
>> > + * elements; changed points one past the start (see xprepare.c).
>> > + */
>> > + memset(xe->xdf1.changed - 1, 0,
>> > + (xe->xdf1.nrec + 2) * sizeof(bool));
>> > + memset(xe->xdf2.changed - 1, 0,
>> > + (xe->xdf2.nrec + 2) * sizeof(bool));
>>
>> This, especially the starting offset of -1, looks horrible. The
>> internal layout of xdfenv_t might happen to match the way the above
>> code expects, which is how xdl_prepare_ctx() may have give you, but
>> it somehow feels brittle. I guess the assumption that changed[]
>> does not point at the beginning of the allocated area (e.g., it is a
>> no-no to free(xe->xdf1.changed) or realloc() it) is so pervasive that
>> it cannot be helped. Sigh.
>>
>
> Agreed it is ugly. I wanted to make sure the entire changed[] including
> sentinels were clear as a defensive measure for downstream callers
> (xdl_change_compact). I agree this results in something that is ugly
> and brittle, but in the end I thought it was superior to relying on the
> fact that upstream zeroes the entire changed[] array. Maybe if the
> comment was more explicit about why this is happening it would be
> helpful?
Perhaps make these memset() into calls to a helper function that is
defined in xdiff/xprepare.c with a descriptive name and placed near
where xdl_prepare_ctx() is. That way, the patch in question does
not even have to expose the strangeness of changed[] (i.e., it has 2
more elements than it would normally contain to make the memory
region for changed[-1] and changed[N] valid, and freeing it requires
free(changed-1)) to the code path. It only needs to say "Hey, I am
clearing changed[] arrays because of XXX" without having to say "by
the way, the memory layout of changed[] is strange this way", the
latter of which is not exactly of interest for readers of this code.
> /*
> * Clear changed[] arrays including sentinels.
> * xdl_prepare_env() may have dirtied them via
> * xdl_cleanup_records(), and xdl_change_compact() reads
> * the sentinel at changed[-1] during backward scans.
> */
And this belongs in xdiff/xprepare.c near that new helper function.There was a problem hiding this comment.
Michael Montalbo wrote on the Git mailing list (how to reply to this email):
On Sun, May 24, 2026 at 1:50 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Michael Montalbo <mmontalbo@gmail.com> writes:
>
> >> > + * Clear changed[] arrays. xdl_prepare_env() may have dirtied
> >> > + * them via xdl_cleanup_records(). The allocation is nrec + 2
> >> > + * elements; changed points one past the start (see xprepare.c).
> >> > + */
> >> > + memset(xe->xdf1.changed - 1, 0,
> >> > + (xe->xdf1.nrec + 2) * sizeof(bool));
> >> > + memset(xe->xdf2.changed - 1, 0,
> >> > + (xe->xdf2.nrec + 2) * sizeof(bool));
> >>
> >> This, especially the starting offset of -1, looks horrible. The
> >> internal layout of xdfenv_t might happen to match the way the above
> >> code expects, which is how xdl_prepare_ctx() may have give you, but
> >> it somehow feels brittle. I guess the assumption that changed[]
> >> does not point at the beginning of the allocated area (e.g., it is a
> >> no-no to free(xe->xdf1.changed) or realloc() it) is so pervasive that
> >> it cannot be helped. Sigh.
> >>
> >
> > Agreed it is ugly. I wanted to make sure the entire changed[] including
> > sentinels were clear as a defensive measure for downstream callers
> > (xdl_change_compact). I agree this results in something that is ugly
> > and brittle, but in the end I thought it was superior to relying on the
> > fact that upstream zeroes the entire changed[] array. Maybe if the
> > comment was more explicit about why this is happening it would be
> > helpful?
>
> Perhaps make these memset() into calls to a helper function that is
> defined in xdiff/xprepare.c with a descriptive name and placed near
> where xdl_prepare_ctx() is. That way, the patch in question does
> not even have to expose the strangeness of changed[] (i.e., it has 2
> more elements than it would normally contain to make the memory
> region for changed[-1] and changed[N] valid, and freeing it requires
> free(changed-1)) to the code path. It only needs to say "Hey, I am
> clearing changed[] arrays because of XXX" without having to say "by
> the way, the memory layout of changed[] is strange this way", the
> latter of which is not exactly of interest for readers of this code.
>
> > /*
> > * Clear changed[] arrays including sentinels.
> > * xdl_prepare_env() may have dirtied them via
> > * xdl_cleanup_records(), and xdl_change_compact() reads
> > * the sentinel at changed[-1] during backward scans.
> > */
>
> And this belongs in xdiff/xprepare.c near that new helper function.
That sounds a lot nicer. Will update.fd6978a to
9818598
Compare
|
Michael Montalbo wrote on the Git mailing list (how to reply to this email): On Thu, May 21, 2026 at 10:29 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > This series adds diff.<driver>.process, a long-running subprocess protocol
> > that lets external tools provide hunks to git's diff and blame pipelines.
> >
> > Over the past 18 years, git's diff pipeline accumulated many features that
> > operate on hunks: word diff, function context, color-moved, indent
> > heuristic, blame. External tools can replace the pipeline entirely
> > (diff.<driver>.command) or select among builtin algorithms
> > (diff.<driver>.algorithm), but there is no way for a tool to provide
> > line-change information into the pipeline. Tools that understand code
> > structure (tree-sitter parsers, format-aware analyzers, tools like
> > Difftastic and Mergiraf) must bypass git's pipeline and lose access to
> > everything downstream.
> >
> > The protocol follows filter.<driver>.process: pkt-line over stdin/stdout,
> > capability negotiation, one tool invocation per git command. The tool
> > receives file pairs and returns hunk descriptors that git feeds into the
> > standard xdiff pipeline. All output features work normally.
> >
> > Zero hunks with status=success means the tool considers the files
> > equivalent. git diff shows no output for the file, and git blame skips the
> > commit, attributing lines to earlier commits.
> >
> > On error or tool crash, git falls back silently to the builtin diff
> > algorithm. The feature is opt-in via diff.<driver>.process and
> > .gitattributes; unconfigured files are unaffected.
> >
> > The series includes git diff-process-normalize, a built-in tool that
> > compares files line by line ignoring whitespace (same logic as "git diff -w"
> > via xdiff_compare_lines):
>
> Interesting.
>
> If the goal is purely to normalize content before comparison
> (e.g. stripping comments or canonicalizing formatting), we already
> have the `textconv` mechanism. While `textconv` is a "one-shot"
> per-file process, it is significantly simpler.
>
> I suspect, however, that the primary focus here is to allow external
> tools to provide structural alignment (e.g. for AST- aware diffs
> like Difftastic or Mergiraf) without losing the original content in
> the display. Unlike `textconv`, which transforms the text the user
> sees, this protocol lets the display remain identical to the source
> while using a custom engine for the line-matching logic.
>
> If that is the intent, it should be stated more explicitly in the
> documentation and commit messages. The "whitespace-normalize"
> demonstration in [PATCH 5/5] is misleading because it's exactly the
> case where `textconv` would be sufficient.
>
Thank you for looking at this.
Yes, you have correctly identified the primary focus. My intention with the
whitespace normalization example was to provide a kind of "hello world"
diff process that would showcase how such a tool could interact with
the pipeline further down (i.e., blame vs diff output). However, I do agree
that it is a confusing example because it seems to clash with something
`textconv` already provides. I will update messaging across the series to
make the true intention of these changes more clear.
> I am afraid that the use of a long-running subprocess for every
> diff/blame invocation adds significant complexity and overhead. In
> particular, wouldn't the `blame` implementation performs a
> round-trip to the subprocess for every commit in the history? Even
> with a persistent process, the overhead of serializing and
> deserializing the entire file content twice (old and new) for every
> commit could be prohibitive for large files or deep histories.
>
> So, I dunno.
I hear you on this point. Anecdotally, my measurement of running blame
on diff.c:
Performance:
without process: 0.57s
with process: 0.67s
Blame attribution:
without process: 726 unique commits
with process: 721 unique commits (5 whitespace-only skipped)
Skipped commits:
0ea7d5b6f8 diff.c: fix some recent whitespace style violations
2775d92c53 diff.c: stylefix
4b25d091ba Fix a bunch of pointer declarations (codestyle)
a6080a0a44 War on whitespace
eeefa7c90e Style fixes, add a space after if/for/while.
I was imagining we could potentially optimize performance by extending the
protocol to enable passing OIDs as a capability so tools could read objects
directly without needing any serialization/deserialization. |
73688bd to
39ff53a
Compare
|
/submit |
|
Submitted as pull.2120.v2.git.1779733799.gitgitgadget@gmail.com To fetch this version into To fetch this version to local tag |
| @@ -218,6 +218,14 @@ endif::git-diff[] | |||
| Set this option to `true` to make the diff driver cache the text | |||
There was a problem hiding this comment.
Junio C Hamano wrote on the Git mailing list (how to reply to this email):
"Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes:
> +struct diff_subprocess {
> + struct subprocess_entry subprocess;
> + unsigned int supported_capabilities;
> +};
> +
> +static int subprocess_map_initialized;
> +static struct hashmap subprocess_map;
Can we avoid introducing new global variables like these? Would
"struct userdiff_driver" or "struct diff_options" be a good place to
hang this hashmap, perhaps?
> +static int send_file_content(int fd, const char *buf, long size)
> +{
> + int ret;
> +
> + if (size > 0)
> + ret = write_packetized_from_buf_no_flush(buf, size, fd);
> + else
> + ret = 0;
Shouldn't "size == -24" be flagged as an invalid input?
> + if (ret)
> + return ret;
> + return packet_flush_gently(fd);
> +}
> +static int parse_hunk_line(const char *line, struct xdl_hunk *hunk)
> +{
> +...
> +}
This gives a silent error diagnosis, which is good for a lower level
helper.
> +int diff_process_get_hunks(struct userdiff_driver *drv,
> + const char *path,
> + const char *old_buf, long old_size,
> + const char *new_buf, long new_size,
> + struct xdl_hunk **hunks_out,
> + size_t *nr_hunks_out)
> +{
> + struct diff_subprocess *backend;
> + struct child_process *process;
> + int fd_in, fd_out;
> + struct strbuf status = STRBUF_INIT;
> + struct xdl_hunk *hunks = NULL;
> + struct xdl_hunk hunk;
> + size_t nr_hunks = 0, alloc_hunks = 0;
> + int len;
> + char *line;
> +
> + if (!drv || !drv->process)
> + return -1;
A driver that does not define process is not an error; it is
perfectly normal in the current world order where nobody has such an
external process and even fi this patch lands, external processes
are optional. So here "return -1" does not mean an error, and
silent return is perfectly fine.
> + backend = find_or_start_process(drv->process);
> + if (!backend)
> + return -1;
This is probably an error; the user specified drv->process, we
either tried to find or start the process and failed. Isn't it an
event that deserves to be reported in an error message?
> + if (!(backend->supported_capabilities & CAP_HUNKS))
> + return -1;
Backend started, but the "hunks" feature is not supported. Perhaps
in a year or two, this external process protocol may have become so
popular that it gained more capabilities, possibly making get_hunks
obsolete. We may be looking at such an external process that uses
other capabilities but not this one. This is not an error, so
silent return is perfectly fine.
> + process = subprocess_get_child_process(&backend->subprocess);
> + fd_in = process->in;
> + fd_out = process->out;
> +
> + /* Send request */
> + if (packet_write_fmt_gently(fd_in, "command=hunks\n") ||
> + packet_write_fmt_gently(fd_in, "pathname=%s\n", path) ||
> + packet_flush_gently(fd_in))
> + goto error;
> +
> + /* Send old file content */
> + if (send_file_content(fd_in, old_buf, old_size))
> + goto error;
> +
> + /* Send new file content */
> + if (send_file_content(fd_in, new_buf, new_size))
> + goto error;
> +
> + /* Read hunks until flush packet */
> + while ((len = packet_read_line_gently(fd_out, NULL, &line)) >= 0 &&
> + line) {
> + if (parse_hunk_line(line, &hunk) < 0)
> + goto error;
> + ALLOC_GROW(hunks, nr_hunks + 1, alloc_hunks);
> + hunks[nr_hunks++] = hunk;
> + }
> + if (len < 0)
> + goto error;
> +
> + /* Read status */
> + if (subprocess_read_status(fd_out, &status))
> + goto error;
> +
> + if (strcmp(status.buf, "success")) {
> + if (!strcmp(status.buf, "abort"))
> + backend->supported_capabilities &= ~CAP_HUNKS;
> + goto error;
> + }
> +
> + *hunks_out = hunks;
> + *nr_hunks_out = nr_hunks;
> + strbuf_release(&status);
> + return 0;
> +
> +error:
All exceptions that lead here look like events that should be
reported to the end-user.
> + free(hunks);
> + strbuf_release(&status);
> + return -1;
> +}
> +/*
> + * Query a diff process for hunks describing the changes
> + * between old_buf and new_buf.
> + *
> + * The backend is a long-running subprocess configured via
> + * diff.<driver>.process. It receives file content via
> + * pkt-line and returns hunks with 1-based line numbers.
> + *
> + * On success, sets *hunks_out and *nr_hunks_out to a newly allocated
> + * array (caller must free) and returns 0.
> + *
> + * On failure, returns -1. The caller should fall back to the
> + * builtin diff algorithm.
> + */
I do not agree with this. If it is a failure, the user should fix
the external process (or disable). It shouldn't be hidden behind a
fallback. As I left comments, in this round of implementation,
there are conditions that returns -1 for soemthing that is not an
error (i.e., not configured, or process not supporting the
particular capability) *and* in those cases the caller should fall
back as if nothing happened. But some error cases, the caller
should't hide them.There was a problem hiding this comment.
Michael Montalbo wrote on the Git mailing list (how to reply to this email):
On Mon, May 25, 2026 at 6:56 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +struct diff_subprocess {
> > + struct subprocess_entry subprocess;
> > + unsigned int supported_capabilities;
> > +};
> > +
> > +static int subprocess_map_initialized;
> > +static struct hashmap subprocess_map;
>
> Can we avoid introducing new global variables like these? Would
> "struct userdiff_driver" or "struct diff_options" be a good place to
> hang this hashmap, perhaps?
>
Will clean this up.
> > +static int send_file_content(int fd, const char *buf, long size)
> > +{
> > + int ret;
> > +
> > + if (size > 0)
> > + ret = write_packetized_from_buf_no_flush(buf, size, fd);
> > + else
> > + ret = 0;
>
> Shouldn't "size == -24" be flagged as an invalid input?
>
Will fix and do a broader audit of input validation and bounds checking.
> > + if (ret)
> > + return ret;
> > + return packet_flush_gently(fd);
> > +}
>
> > +static int parse_hunk_line(const char *line, struct xdl_hunk *hunk)
> > +{
> > +...
> > +}
>
> This gives a silent error diagnosis, which is good for a lower level
> helper.
>
> > +int diff_process_get_hunks(struct userdiff_driver *drv,
> > + const char *path,
> > + const char *old_buf, long old_size,
> > + const char *new_buf, long new_size,
> > + struct xdl_hunk **hunks_out,
> > + size_t *nr_hunks_out)
> > +{
> > + struct diff_subprocess *backend;
> > + struct child_process *process;
> > + int fd_in, fd_out;
> > + struct strbuf status = STRBUF_INIT;
> > + struct xdl_hunk *hunks = NULL;
> > + struct xdl_hunk hunk;
> > + size_t nr_hunks = 0, alloc_hunks = 0;
> > + int len;
> > + char *line;
> > +
> > + if (!drv || !drv->process)
> > + return -1;
>
> A driver that does not define process is not an error; it is
> perfectly normal in the current world order where nobody has such an
> external process and even fi this patch lands, external processes
> are optional. So here "return -1" does not mean an error, and
> silent return is perfectly fine.
>
> > + backend = find_or_start_process(drv->process);
> > + if (!backend)
> > + return -1;
>
> This is probably an error; the user specified drv->process, we
> either tried to find or start the process and failed. Isn't it an
> event that deserves to be reported in an error message?
>
> > + if (!(backend->supported_capabilities & CAP_HUNKS))
> > + return -1;
>
> Backend started, but the "hunks" feature is not supported. Perhaps
> in a year or two, this external process protocol may have become so
> popular that it gained more capabilities, possibly making get_hunks
> obsolete. We may be looking at such an external process that uses
> other capabilities but not this one. This is not an error, so
> silent return is perfectly fine.
>
> > + process = subprocess_get_child_process(&backend->subprocess);
> > + fd_in = process->in;
> > + fd_out = process->out;
> > +
> > + /* Send request */
> > + if (packet_write_fmt_gently(fd_in, "command=hunks\n") ||
> > + packet_write_fmt_gently(fd_in, "pathname=%s\n", path) ||
> > + packet_flush_gently(fd_in))
> > + goto error;
> > +
> > + /* Send old file content */
> > + if (send_file_content(fd_in, old_buf, old_size))
> > + goto error;
> > +
> > + /* Send new file content */
> > + if (send_file_content(fd_in, new_buf, new_size))
> > + goto error;
> > +
> > + /* Read hunks until flush packet */
> > + while ((len = packet_read_line_gently(fd_out, NULL, &line)) >= 0 &&
> > + line) {
> > + if (parse_hunk_line(line, &hunk) < 0)
> > + goto error;
> > + ALLOC_GROW(hunks, nr_hunks + 1, alloc_hunks);
> > + hunks[nr_hunks++] = hunk;
> > + }
> > + if (len < 0)
> > + goto error;
> > +
> > + /* Read status */
> > + if (subprocess_read_status(fd_out, &status))
> > + goto error;
> > +
> > + if (strcmp(status.buf, "success")) {
> > + if (!strcmp(status.buf, "abort"))
> > + backend->supported_capabilities &= ~CAP_HUNKS;
> > + goto error;
> > + }
> > +
> > + *hunks_out = hunks;
> > + *nr_hunks_out = nr_hunks;
> > + strbuf_release(&status);
> > + return 0;
> > +
> > +error:
>
> All exceptions that lead here look like events that should be
> reported to the end-user.
>
Agreed on all points. I will restructure things so errors are flagged when
appropriate (i.e., user specified a process but one was not found / couldn't
start and exceptions) and non-errors are treated as they should be.
> > + free(hunks);
> > + strbuf_release(&status);
> > + return -1;
> > +}
>
> > +/*
> > + * Query a diff process for hunks describing the changes
> > + * between old_buf and new_buf.
> > + *
> > + * The backend is a long-running subprocess configured via
> > + * diff.<driver>.process. It receives file content via
> > + * pkt-line and returns hunks with 1-based line numbers.
> > + *
> > + * On success, sets *hunks_out and *nr_hunks_out to a newly allocated
> > + * array (caller must free) and returns 0.
> > + *
> > + * On failure, returns -1. The caller should fall back to the
> > + * builtin diff algorithm.
> > + */
>
> I do not agree with this. If it is a failure, the user should fix
> the external process (or disable). It shouldn't be hidden behind a
> fallback. As I left comments, in this round of implementation,
> there are conditions that returns -1 for soemthing that is not an
> error (i.e., not configured, or process not supporting the
> particular capability) *and* in those cases the caller should fall
> back as if nothing happened. But some error cases, the caller
> should't hide them.
Will address in a follow-up.
Thank you for the feedback!| @@ -218,6 +218,14 @@ endif::git-diff[] | |||
| Set this option to `true` to make the diff driver cache the text | |||
There was a problem hiding this comment.
Junio C Hamano wrote on the Git mailing list (how to reply to this email):
"Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes:
> Zero hunks with status=success means the tool considers the
> files equivalent. Git skips diff output for that file.
Is "zero hunk" a common word or some random string you invented? If
the latter, which is I am assuming it to be, you should define what
it means at/before the first use. Here in the proposed log message,
and ...
>
> Signed-off-by: Michael Montalbo <mmontalbo@gmail.com>
> ---
> Documentation/config/diff.adoc | 8 +
> Documentation/gitattributes.adoc | 40 ++++
> Makefile | 1 +
> diff-process.c | 206 +++++++++++++++++++
> diff-process.h | 28 +++
> diff.c | 23 +++
> t/.gitattributes | 1 +
> t/t4080-diff-process.sh | 338 +++++++++++++++++++++++++++++++
> 8 files changed, 645 insertions(+)
> create mode 100644 diff-process.c
> create mode 100644 diff-process.h
> create mode 100755 t/t4080-diff-process.sh
>
> diff --git a/Documentation/config/diff.adoc b/Documentation/config/diff.adoc
> index 1135a62a0a..4ab5f60df6 100644
> --- a/Documentation/config/diff.adoc
> +++ b/Documentation/config/diff.adoc
> @@ -218,6 +218,14 @@ endif::git-diff[]
> Set this option to `true` to make the diff driver cache the text
> conversion outputs. See linkgit:gitattributes[5] for details.
>
> +`diff.<driver>.process`::
> + The command to run as a long-running diff process.
> + The tool communicates via the pkt-line protocol and returns
> + hunks that are fed into Git's diff and blame pipelines.
> + If the tool returns zero hunks, the file is treated as
> + unchanged for both diff output and blame attribution.
> + See linkgit:gitattributes[5] for details.
... also here.
I do not know if you mean "the tool returns no hunks" (there is no
"hunk <old_start> <old_count> <new_start> <new_count>" line passed
from the tool over the protocol) or "the tool returns zero-hunk"
(there is a special "zero-hunk" message to signal this particular
condition sent over the protocol), and this description does not
quite help disambiguating between the two.
If the former, then avoid "zero hunks" as it sounds like a noun with
special meaning. Yes, we can say "tool returns one hunk", "tool
returns 31 hunks", etc., so "tool returns zero hunks" may logically
be correct, but "when the tool returns no hunks with status=success"
is much less confusing, I think.There was a problem hiding this comment.
Michael Montalbo wrote on the Git mailing list (how to reply to this email):
On Mon, May 25, 2026 at 7:26 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > Zero hunks with status=success means the tool considers the
> > files equivalent. Git skips diff output for that file.
>
> Is "zero hunk" a common word or some random string you invented? If
> the latter, which is I am assuming it to be, you should define what
> it means at/before the first use. Here in the proposed log message,
> and ...
>
> >
> > Signed-off-by: Michael Montalbo <mmontalbo@gmail.com>
> > ---
> > Documentation/config/diff.adoc | 8 +
> > Documentation/gitattributes.adoc | 40 ++++
> > Makefile | 1 +
> > diff-process.c | 206 +++++++++++++++++++
> > diff-process.h | 28 +++
> > diff.c | 23 +++
> > t/.gitattributes | 1 +
> > t/t4080-diff-process.sh | 338 +++++++++++++++++++++++++++++++
> > 8 files changed, 645 insertions(+)
> > create mode 100644 diff-process.c
> > create mode 100644 diff-process.h
> > create mode 100755 t/t4080-diff-process.sh
> >
> > diff --git a/Documentation/config/diff.adoc b/Documentation/config/diff.adoc
> > index 1135a62a0a..4ab5f60df6 100644
> > --- a/Documentation/config/diff.adoc
> > +++ b/Documentation/config/diff.adoc
> > @@ -218,6 +218,14 @@ endif::git-diff[]
> > Set this option to `true` to make the diff driver cache the text
> > conversion outputs. See linkgit:gitattributes[5] for details.
> >
> > +`diff.<driver>.process`::
> > + The command to run as a long-running diff process.
> > + The tool communicates via the pkt-line protocol and returns
> > + hunks that are fed into Git's diff and blame pipelines.
> > + If the tool returns zero hunks, the file is treated as
> > + unchanged for both diff output and blame attribution.
> > + See linkgit:gitattributes[5] for details.
>
> ... also here.
>
> I do not know if you mean "the tool returns no hunks" (there is no
> "hunk <old_start> <old_count> <new_start> <new_count>" line passed
> from the tool over the protocol) or "the tool returns zero-hunk"
> (there is a special "zero-hunk" message to signal this particular
> condition sent over the protocol), and this description does not
> quite help disambiguating between the two.
>
> If the former, then avoid "zero hunks" as it sounds like a noun with
> special meaning. Yes, we can say "tool returns one hunk", "tool
> returns 31 hunks", etc., so "tool returns zero hunks" may logically
> be correct, but "when the tool returns no hunks with status=success"
> is much less confusing, I think.
Yes, "zero hunks" was my own invention and I see why it's confusing. Will
update the messaging to use "no hunks" instead and do a broader sweep of
the documentation to clarify the protocol and expected tool behavior.25f40d4 to
162b2cc
Compare
8814412 to
0093183
Compare
Add two new xpparam_t fields (external_hunks, external_hunks_nr) that let callers supply pre-computed hunks. When set, xdl_diff() populates the changed[] arrays from these hunks instead of running the diff algorithm, then continues through compaction and emission as usual. Validate supplied hunks before use: reject out-of-bounds line numbers, overlapping or out-of-order hunks, negative counts, and violations of the synchronization invariant (unchanged line counts must match between files). On validation failure, fall back to the builtin diff algorithm; this re-runs xdl_prepare_env() since the first call may have dirtied the changed[] arrays. Skip trim_common_tail() in xdi_diff() when external hunks are present, since external hunks reference line numbers in the original content. Signed-off-by: Michael Montalbo <mmontalbo@gmail.com>
Add a new per-driver configuration key that specifies the command for a long-running diff process. The name follows filter.<driver>.process: a long-running subprocess that stays alive across files within a single git invocation. Also add a diff_subprocess pointer to userdiff_driver, which the next commit uses to store the launched subprocess on the driver. Signed-off-by: Michael Montalbo <mmontalbo@gmail.com>
Extract subprocess_start_command() from subprocess_start() so that callers can start and handshake with a long-running subprocess without inserting it into a hashmap. Similarly, extract subprocess_stop_command() from subprocess_stop(). subprocess_start() becomes a thin wrapper: start_command + hashmap insertion. This enables callers that manage subprocess lifetime through their own data structures (e.g., per-driver pointers) to reuse the child process setup and handshake machinery without maintaining a parallel hashmap. No functional change for existing callers. Signed-off-by: Michael Montalbo <mmontalbo@gmail.com>
0093183 to
abbd622
Compare
Add support for external diff processes that communicate via the
long-running process protocol (pkt-line over stdin/stdout).
A diff process is configured per userdiff driver:
[diff "cdiff"]
process = /path/to/diff-tool
The tool provides custom line-matching: it receives file pairs
and returns hunks that reference line numbers in the content.
When textconv is also configured, the tool receives the
textconv-transformed content. The tool controls which lines
are marked as changed while the display shows the file content.
Patch output features (word diff, function context, color) work
normally; summary formats like --stat use their own diff path
and are not affected.
The handshake negotiates version=1 and capability=hunks. Per-file
requests send command=hunks, pathname, and both file contents as
packetized data. The tool responds with hunk lines and a status
packet (success, error, or abort). On error, git warns and falls
back to the builtin diff algorithm for that file. On abort, git
silently falls back for the current file and stops sending further
requests to the tool for the remainder of the session.
When the tool returns no hunks with status=success, git treats the
file as having no changes and produces no diff output. This also
means --exit-code reports no changes for that file.
The subprocess is stored on the userdiff_driver struct and launched
on first use, avoiding new global variables. If the process fails
to start, the handshake fails, or a communication error occurs
mid-stream, the failure is cached on the driver to avoid retrying
and re-warning on every subsequent file.
diff_process_fill_hunks() is the sole public entry point. It
handles driver lookup, flag checks, subprocess management, and
error reporting, returning an enum that lets callers distinguish
"hunks populated" from "files equivalent" from "not applicable"
from "tool failure."
Signed-off-by: Michael Montalbo <mmontalbo@gmail.com>
Make --no-ext-diff disable diff.<driver>.process in addition to diff.<driver>.command. The two mechanisms serve the same purpose (external diff tools), so a single flag should control both. Replace the OPT_BOOL for --ext-diff with an OPT_CALLBACK that sets both allow_external and the new no_diff_process flag, keeping the option's user-facing behavior unchanged while extending it to cover the diff process. Passing --ext-diff explicitly clears no_diff_process, so a later --ext-diff overrides an earlier --no-ext-diff. Disable the diff process unconditionally in format-patch so that generated patches are always based on the builtin diff algorithm and can be applied reliably by recipients who do not have the external tool. Document that --diff-algorithm also bypasses the diff process, since it sets ignore_driver_algorithm which diff_process_fill_hunks already checks. Signed-off-by: Michael Montalbo <mmontalbo@gmail.com>
When a diff process is configured via diff.<driver>.process, consult it during blame's per-commit diffing. If the process returns no hunks for a commit's changes to a file, treat the commit as having no changes, causing blame to attribute lines to earlier commits. The consultation happens at the pass_blame_to_parent() callsite using diff_process_fill_hunks(), matching how builtin_diff() in diff.c uses the same function. The existing diff_hunks() helper is unchanged; the callsite sets up xdi_diff() directly when external hunks are involved. The copy-detection callsite is unaffected since it does not use the diff process. The subprocess is long-running (one startup cost amortized across the blame traversal), but each commit in the file's history incurs a round-trip to the tool. Signed-off-by: Michael Montalbo <mmontalbo@gmail.com>
abbd622 to
1ad831a
Compare
This series adds
diff.<driver>.process, a long-running subprocessprotocol that lets external tools provide hunks into git's diff and
blame pipelines.
diff.<driver>.processopens the door to integrating content-awarelogic that don't exist inside git: structural diff tools, format-aware
analyzers, and other tools that understand the semantics of the
content being tracked. Where
diff.<driver>.commandreplacesthe diff pipeline entirely,
diff.<driver>.processfeeds hunks into it,so all downstream features (word diff, function context, color-moved,
stat, blame) work normally.
The protocol follows
filter.<driver>.process: pkt-line overstdin/stdout, capability negotiation, one tool invocation per
git command. The tool receives file pairs and returns hunk
descriptors that git feeds into the standard xdiff pipeline.
Zero hunks with
status=successmeans the tool considers thefiles equivalent. git diff shows no output for the file, and
git blame skips the commit, attributing lines to earlier
commits.
On error or tool crash, git falls back silently to the builtin
diff algorithm. The feature is opt-in via
diff.<driver>.processand .gitattributes; unconfigured files are unaffected.
The blame integration calls the diff process for each commit
in the file's history. The subprocess is long-running (one
startup amortized across the traversal), but per-commit
round-trips add latency. A natural follow-up would be a
capability that sends blob OIDs instead of content, allowing
tools that can read the object store directly to avoid the
cost of serializing and deserializing blob content over the
pipe for each file pair.
Changes since v1:
obscured the main use case for the RFC and update series
messaging accordingly.
helper in xprepare.c.
through to the regular diff algorithm on validation failure
instead of duplicating diff logic on fallback then returning.